# Visual Language Understanding
Blip Arabic Flickr 8k
MIT
Arabic image captioning model fine-tuned based on BLIP architecture, specifically optimized for the Flickr8k Arabic dataset
Image-to-Text
Transformers Supports Multiple Languages

B
omarsabri8756
56
1
Skywork R1V2 38B
MIT
Skywork-R1V2-38B is currently the most advanced open-source multimodal reasoning model, demonstrating outstanding performance in multiple benchmark tests with robust visual reasoning and text comprehension capabilities.
Image-to-Text
Transformers

S
Skywork
1,778
105
Emova Qwen 2 5 3b
Apache-2.0
EMOVA is an end-to-end omni-modal large language model that supports visual, auditory, and speech functions, capable of generating text and speech responses with emotional control.
Multimodal Fusion
Transformers Supports Multiple Languages

E
Emova-ollm
25
2
VL Rethinker 7B Fp16
Apache-2.0
This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.
Text-to-Image
Transformers English

V
mlx-community
17
0
Qwen2.5 VL 7B Instruct Gptqmodel Int8
MIT
A vision-language model based on the Qwen2.5-VL-7B-Instruct model with GPTQ-INT8 quantization
Image-to-Text
Transformers Supports Multiple Languages

Q
wanzhenchn
101
0
Qwen2.5 VL 72B Instruct GGUF
Other
Qwen2.5-VL-72B-Instruct is a 72B-parameter multimodal large model that supports vision-language tasks, capable of understanding and generating text related to images.
Text-to-Image English
Q
Mungert
2,798
5
Rexseek 3B
Other
This is an image-to-text conversion model capable of processing both image and text inputs to generate corresponding text outputs.
Text-to-Image
Transformers

R
IDEA-Research
186
4
Qwen2 VL 7B Captioner Relaxed GGUF
Apache-2.0
This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.
Image-to-Text English
Q
r3b31
321
1
Deepseer R1 Vision Distill Qwen 1.5B Google Vit Base Patch16 224
Apache-2.0
DeepSeer is a vision-language model developed based on the DeepSeek-R1 model, supporting chain-of-thought reasoning and trained through dialogue templates for visual models.
Image-to-Text
Transformers

D
mehmetkeremturkcan
25
2
Emu3 Stage1
Apache-2.0
Emu3 is a multimodal model developed by the Beijing Academy of Artificial Intelligence, trained solely by predicting the next token, supporting image, text, and video processing.
Text-to-Image
Transformers

E
BAAI
1,359
26
Llama 3 EvoVLM JP V2
Llama-3-EvoVLM-JP-v2 is an experimental general-purpose Japanese vision-language model that supports interleaved input of text and images. This model was created using an evolutionary model fusion approach.
Image-to-Text
Transformers Japanese

L
SakanaAI
475
20
Cephalo Idefics 2 Vision 10b Alpha
Apache-2.0
Cephalo is a series of vision-based large language models (V-LLMs) focused on multimodal materials science, designed to integrate visual and linguistic data to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.
Image-to-Text
Transformers Other

C
lamm-mit
137
1
Open Llava Next Llama3 8b
Apache-2.0
An open-source chatbot model trained by fine-tuning the entire model on open-source data, which can be used for research on multimodal models and chatbots.
Text-to-Image
Transformers

O
Lin-Chen
323
26
Cephalo Idefics 2 Vision 8b Alpha
Apache-2.0
Cephalo is a series of vision-based large language models (V-LLMs) focused on multimodal materials science, designed to integrate visual and linguistic data to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.
Image-to-Text
Transformers Other

C
lamm-mit
150
1
Llava Jp 1.3b V1.1
LLaVA-JP is a multimodal vision-language model that supports Japanese, capable of understanding and generating descriptions and dialogues about input images.
Image-to-Text
Transformers Japanese

L
toshi456
90
11
Image Model
This is a transformers-based image-to-text conversion model, specific functionalities require further details
Image-to-Text
Transformers

I
Mouwiya
15
0
Llava V1.5 13b Dpo Gguf
LLaVA-v1.5-13B-DPO is a vision-language model based on the LLaVA framework, trained with Direct Preference Optimization (DPO) and converted to GGUF quantized format to improve inference efficiency.
Image-to-Text
L
antiven0m
30
0
Llava V1.6 34b
Apache-2.0
LLaVA is an open-source multimodal chatbot, fine-tuned based on a large language model, supporting interactions with both images and text.
Text-to-Image
L
liuhaotian
9,033
351
Moe LLaVA StableLM 1.6B 4e
Apache-2.0
MoE-LLaVA is a large-scale vision-language model based on a mixture of experts architecture, achieving efficient multimodal learning through sparsely activated parameters.
Text-to-Image
Transformers

M
LanguageBind
125
8
Tiny Llava V1 Hf
Apache-2.0
TinyLLaVA is a compact large-scale multimodal model framework focused on vision-language tasks, featuring small parameter size yet excellent performance.
Image-to-Text
Transformers Supports Multiple Languages

T
bczhou
2,372
57
Llava 7B Lightening V1 1
LLaVA-Lightning-7B is a multimodal model based on LLaMA-7B, achieving efficient vision-language task processing through delta parameter tuning.
Large Language Model
Transformers

L
mmaaz60
1,736
10
Pix2struct Ocrvqa Base
Apache-2.0
Pix2Struct is a visual question answering model fine-tuned for OCR-VQA tasks, capable of parsing textual content in images and answering questions
Image-to-Text
Transformers Supports Multiple Languages

P
google
38
1
Featured Recommended AI Models